透明的物体广泛用于工业自动化和日常生活中。但是,强大的视觉识别和对透明物体的感知一直是一个主要挑战。目前,由于光的折射和反射,大多数商用级深度摄像机仍然不擅长感知透明物体的表面。在这项工作中,我们从单个RGB-D输入中提出了一种基于变压器的透明对象深度估计方法。我们观察到,变压器的全球特征使得更容易提取上下文信息以执行透明区域的深度估计。此外,为了更好地增强细粒度的特征,功能融合模块(FFM)旨在帮助连贯的预测。我们的经验证据表明,与以前的最新基于卷积的数据集相比,我们的模型在最近的流行数据集中有了重大改进,例如RMSE增长25%,RER增长21%。广泛的结果表明,我们的基于变压器的模型可以更好地汇总对象的RGB和不准确的深度信息,以获得更好的深度表示。我们的代码和预培训模型将在https://github.com/yuchendoudou/tode上找到。
translated by 谷歌翻译
理解和预测代理的未来轨迹对于行为分析,机器人导航,自动驾驶汽车和其他相关应用至关重要。先前的方法主要将轨迹预测视为时间序列的产生。与它们不同的是,这项工作在“垂直”视图中研究了代理的轨迹,即来自光谱域的建模和预测轨迹。轨迹光谱中的不同频带可以分层反映不同尺度上的代理运动偏好。低频和高频部分可以分别代表其粗糙运动趋势和细胞运动变化。因此,我们提出了一个层次网络v $^2 $ -NET,其中包含两个子网络,以层次模型并预测具有轨迹谱的代理的轨迹。粗级关键点估计子网络首先预测了代理轨迹在几个“密钥”频率部分上的“最小”频谱。然后,高级频谱插值子网络插值将这些光谱重建最终预测。实验结果表明,在ETH-COY基准和Stanford Drone DataSet上,V $^2 $ -NET的竞争力和优势。
translated by 谷歌翻译
轨迹预测旨在预测代理商可能的未来位置,考虑到他们的观察以及视频背景。这是许多自主平台所要求的,如跟踪,检测,机器人导航,自动驾驶汽车和许多其他电脑视觉应用。无论是代理人的内部人格因素,与社区的互动行为,还是周围环境的影响,所有这些都可能代表对代理商的未来计划的影响。然而,许多以前的方法模型和预测具有相同策略或“单曲”特征分布的代理商的行为,使其具有挑战性地给出足够的风格差异的预测。该稿件提出了利用风格假设和程式化预测的两个子网的多种式网络(MSN),以共同地以新颖的分类方式提供代理多种准式预测。我们使用代理人的终点计划及其交互上下文作为行为分类的基础,以便通过网络中的一系列样式通道自适应地学习多种不同的行为样式。然后,我们假设目标代理将根据这些分类样式中的每一个规划他们未来的行为,从而利用不同的风格频道,以便并行地提供具有重要风格差异的一系列预测。实验表明,所提出的MSN在两个广泛使用的数据集上以最新的最先进的方法优于10 \%-20 \%,并且定性地提出了更好的多样式特性。
translated by 谷歌翻译
In this work, we focus on instance-level open vocabulary segmentation, intending to expand a segmenter for instance-wise novel categories without mask annotations. We investigate a simple yet effective framework with the help of image captions, focusing on exploiting thousands of object nouns in captions to discover instances of novel classes. Rather than adopting pretrained caption models or using massive caption datasets with complex pipelines, we propose an end-to-end solution from two aspects: caption grounding and caption generation. In particular, we devise a joint Caption Grounding and Generation (CGG) framework based on a Mask Transformer baseline. The framework has a novel grounding loss that performs explicit and implicit multi-modal feature alignments. We further design a lightweight caption generation head to allow for additional caption supervision. We find that grounding and generation complement each other, significantly enhancing the segmentation performance for novel categories. We conduct extensive experiments on the COCO dataset with two settings: Open Vocabulary Instance Segmentation (OVIS) and Open Set Panoptic Segmentation (OSPS). The results demonstrate the superiority of our CGG framework over previous OVIS methods, achieving a large improvement of 6.8% mAP on novel classes without extra caption data. Our method also achieves over 15% PQ improvements for novel classes on the OSPS benchmark under various settings.
translated by 谷歌翻译
Recent studies have shown that using an external Language Model (LM) benefits the end-to-end Automatic Speech Recognition (ASR). However, predicting tokens that appear less frequently in the training set is still quite challenging. The long-tail prediction problems have been widely studied in many applications, but only been addressed by a few studies for ASR and LMs. In this paper, we propose a new memory augmented lookup dictionary based Transformer architecture for LM. The newly introduced lookup dictionary incorporates rich contextual information in training set, which is vital to correctly predict long-tail tokens. With intensive experiments on Chinese and English data sets, our proposed method is proved to outperform the baseline Transformer LM by a great margin on both word/character error rate and tail tokens error rate. This is achieved without impact on the decoding efficiency. Overall, we demonstrate the effectiveness of our proposed method in boosting the ASR decoding performance, especially for long-tail tokens.
translated by 谷歌翻译
It is crucial to evaluate the quality and determine the optimal number of clusters in cluster analysis. In this paper, the multi-granularity characterization of the data set is carried out to obtain the hyper-balls. The cluster internal evaluation index based on hyper-balls(HCVI) is defined. Moreover, a general method for determining the optimal number of clusters based on HCVI is proposed. The proposed methods can evaluate the clustering results produced by the several classic methods and determine the optimal cluster number for data sets containing noises and clusters with arbitrary shapes. The experimental results on synthetic and real data sets indicate that the new index outperforms existing ones.
translated by 谷歌翻译
Generalizability to unseen forgery types is crucial for face forgery detectors. Recent works have made significant progress in terms of generalization by synthetic forgery data augmentation. In this work, we explore another path for improving the generalization. Our goal is to reduce the features that are easy to learn in the training phase, so as to reduce the risk of overfitting on specific forgery types. Specifically, in our method, a teacher network takes as input the face images and generates an attention map of the deep features by a diverse multihead attention ViT. The attention map is used to guide a student network to focus on the low-attended features by reducing the highly-attended deep features. A deep feature mixup strategy is also proposed to synthesize forgeries in the feature domain. Experiments demonstrate that, without data augmentation, our method is able to achieve promising performances on unseen forgeries and highly compressed data.
translated by 谷歌翻译
This paper presents a novel framework for planning in unknown and occluded urban spaces. We specifically focus on turns and intersections where occlusions significantly impact navigability. Our approach uses an inpainting model to fill in a sparse, occluded, semantic lidar point cloud and plans dynamically feasible paths for a vehicle to traverse through the open and inpainted spaces. We demonstrate our approach using a car's lidar data with real-time occlusions, and show that by inpainting occluded areas, we can plan longer paths, with more turn options compared to without inpainting; in addition, our approach more closely follows paths derived from a planner with no occlusions (called the ground truth) compared to other state of the art approaches.
translated by 谷歌翻译
In this work, we investigate improving the generalizability of GAN-generated image detectors by performing data augmentation in the fingerprint domain. Specifically, we first separate the fingerprints and contents of the GAN-generated images using an autoencoder based GAN fingerprint extractor, followed by random perturbations of the fingerprints. Then the original fingerprints are substituted with the perturbed fingerprints and added to the original contents, to produce images that are visually invariant but with distinct fingerprints. The perturbed images can successfully imitate images generated by different GANs to improve the generalization of the detectors, which is demonstrated by the spectra visualization. To our knowledge, we are the first to conduct data augmentation in the fingerprint domain. Our work explores a novel prospect that is distinct from previous works on spatial and frequency domain augmentation. Extensive cross-GAN experiments demonstrate the effectiveness of our method compared to the state-of-the-art methods in detecting fake images generated by unknown GANs.
translated by 谷歌翻译
The rapid development of remote sensing technologies have gained significant attention due to their ability to accurately localize, classify, and segment objects from aerial images. These technologies are commonly used in unmanned aerial vehicles (UAVs) equipped with high-resolution cameras or sensors to capture data over large areas. This data is useful for various applications, such as monitoring and inspecting cities, towns, and terrains. In this paper, we presented a method for classifying and segmenting city road traffic dashed lines from aerial images using deep learning models such as U-Net and SegNet. The annotated data is used to train these models, which are then used to classify and segment the aerial image into two classes: dashed lines and non-dashed lines. However, the deep learning model may not be able to identify all dashed lines due to poor painting or occlusion by trees or shadows. To address this issue, we proposed a method to add missed lines to the segmentation output. We also extracted the x and y coordinates of each dashed line from the segmentation output, which can be used by city planners to construct a CAD file for digital visualization of the roads.
translated by 谷歌翻译